knitr::opts_chunk$set(warning=FALSE)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(countrycode)
library(outliers)
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(cluster)
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
library("DMwR")
Loading required package: grid
library("RWeka")
Warning: package ‘RWeka’ was built under R version 4.3.2
library("C50")
library("rpart")
library("themis")
Warning: package ‘themis’ was built under R version 4.3.2Loading required package: recipes
Attaching package: ‘recipes’
The following object is masked from ‘package:stats’:
step
library(rattle)
Warning: package ‘rattle’ was built under R version 4.3.2Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
Warning: package ‘rpart.plot’ was built under R version 4.3.2
library(RColorBrewer)
Prediction of cyber security employees’ salaries based on 11 attributes
1.work_year
2.experience_level
3.employment_type
4.job_title
5.salary
6.salary_currency
7.salary_in_usd
8.employee_residence
9.remote_ratio
10.company_location
11.company_size
We are living in the “information age” or rather the “data age”, meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there’s a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it’s not an easy matter to predict the future payroll which may hinders some of the organisation’s plans. Another issue arises when the decision makers in the organisation aren’t fully aware of the trends on salary and groups of employees and their differint needs. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary that match current trends.
Prediction of the cyber security employees’ salary categories (Very Low, Low, , High, Very High) using classification and description of data characteristics and behavior and grouping data using clustering methods.
Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee and description of the cybersecurity market behavior by grouping the data, which can help in
https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries
dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)
we will keep a copy of the original dataset before data preprocessing to use if needed at any time
originalDataset= dataset
No. of attributes: 11
Type of attributes: Ordinal , Nominal, and Numeric
No. of objects: 1247
Class label: salary_in_usd
ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)
| Attribute Name | Description | Data Type | Possible values |
|---|---|---|---|
| work_year | The year in which salary was recorded | Numerical | 2020 to 2022 |
| experience_level | Expertise level of the employee | Ordinal | En “Entry level” MI “Mid level” SE “Senior level” EX “Executive level” |
| employment_type | The nature or category of employee’s engagement in the job | Nominal | PT “Part time” FT “Full time” CT “Contract FL”Freelancer” |
| job_title | The role worked in during the year | Nominal | Different titles. like Security Analyst, security researcher |
| salary | The total gross salary amount paid | Numerical | 1740-50001566 |
| salary_currency | The currency of the salary paid to the employee | Nominal | Different currencies according to ISO 4217 currency code. like DE,CA |
| salary_in_usd | The salary paid in United states dollar | Numerical | 2000 to 365596.40 |
| employee_residence | Employee’s primary country of residence | Nominal | Different countries. like US,AE |
| remote_ratio | Percentage of online work by employee in the specified year | Numerical | 0 “No remote work” 50 “Partially remote” 100 “Fully remote” |
| company_location | The country of the employer’s main office | Nominal | Different countries. like BR,BW |
| company_size | How big/small is the company | Ordinal | S , M or L |
using sample_n(table,size) function and using (set_seed())
set.seed(30)
sample=sample_n(dataset,20)
print(sample)
if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.
is.na(dataset)
sum(is.na(dataset))
The summary statistics for the dataset variables provide insights into the distribution of features. we can conclude the following: In work_year: The data spans from the year 2020 to 2022 with Most data falling within the years 2021 and 2022, as indicated by both the median and mean being centered around 2021. In salary: Salaries vary widely with a minimum of $1,740 and a maximum of $500 million. The median is $120,000 which is a mid value, but the mean is notably higher at $560,852 which might be duo to extreme values or notable skewness. In salary_in_usd: The data has a median of $110,000, and a mean of $120,278, and the spread of salaries is observable in the difference between the median and mean. In remote_ratio: Indicates the percentage of remote work ranging from 0% to 100%, with a median and 3rd quartile at 100%, and a mean of 71.49%, indicating a notable presence of remote work in the dataset, suggesting some variability.
summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)
By now, we concluded some of the main statistical measures for the numerical columns
variance is to understand the spread or dispersion of the values in each column. A higher variance indicates that the values are more spread out from the mean and in our dataset the highest attribute is salary, while a lower variance indicates that the values are closer to the mean which in our datas it is work year attribute.
var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)
Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
The “salary” column gives the same information as “salary_in_usd” it’s just a matter of currency exchange, and we will eventually transform all the values in “salary” column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.
we will start by creating a temporary column named “converted_salary” to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with “salary” column
convertedDataset=dataset
convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate
set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)
print(salary_sample)
as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)
The correlation is so high but it hasn’t reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.
To make the mining process more effiecent and has an improved quality, we decided to remove the “salary” column.
dataset<-dataset[,-c(5)]
We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.
now we show (salary_in_usd) attributes’ outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.
boxplot(dataset$salary_in_usd)
OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]
now we show (remote_ratio) attributes’ outliers. we can see there aren’t outliers in remote_ratio, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$remote_ratio)
now we show (work_year) attributes’ outliers. we can see there aren’t outliers in work_year, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$work_year)
the columns “company_location” and “employee_residence” have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.
We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:
East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.
Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.
Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.
Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.
North America: This is predominantly United States and Canada.
South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.
Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.
Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don’t belong to any of these regions, thus, they will be used as they are.
um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")
dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")
dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"
Concept hierarchy generation can be done for “job_title” as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.
## Create the categories based on job rank
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
ifelse(grepl("Architect", dataset$job_title), "Architect",
ifelse(grepl("Engineer", dataset$job_title), "Engineer",
ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
"Others"))))))
To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.
dataset$job_title <- factor(dataset$job_title)
dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))
dataset$employment_type <- factor(dataset$employment_type)
dataset$employee_residence <- factor(dataset$employee_residence)
dataset$company_location <- factor(dataset$company_location)
dataset$salary_currency <- factor(dataset$salary_currency)
dataset$job_title <- factor(dataset$job_title)
dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))
dataset$job_title <- factor(dataset$job_title)
by calculating breaks based on quartiles
breaks <- quantile(dataset$salary_in_usd,
probs = c(0, .25, .5, .75, .95, 1),
na.rm = TRUE)
dataset$salary_in_usd <- cut(dataset$salary_in_usd,
breaks = breaks,
include.lowest = TRUE,
labels=c("Very Low", "Low", "Medium", "High", "Very High"))
to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight
dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])
we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.
we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose “Random Forest” because it has high accuracy, can handle categorical data.
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
number = 10)
First we save the features to be used in the feature selection(every attributes except the class label “salary_in_usd”) in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.
x <- dataset %>%
select(-salary_in_usd) %>%
as.data.frame()
# Target variable
y <- dataset$salary_in_usd
# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]
x_train <- x[ inTrain, ]
x_test <- x[-inTrain, ]
y_train <- y[ inTrain]
y_test <- y[-inTrain]
after splitting the data, now we can perform the selection using rfe
set.seed(1)
result_rfe1 <- rfe(x = x_train,
y = y_train,
sizes = c(1:9),
rfeControl = control)
result_rfe1
predictors(result_rfe1)
The results show that all the remaining attributes, except for “employment_type”, are selected. This is logical, as 98% of the rows have the value “FT”, as shown in the table below. Due to the low variance, we decided to remove this attribute.
table(dataset$employment_type)
dataset<-dataset[,-which(names(dataset)=="employment_type")]
During this phase, our focus will be on clustering and classification techniques to analyze the data. The primary objectives are to identify distinct groups within the dataset through clustering, classify data objects into meaningful categories, and apply different evaluation methods to assess the accuracy and precision of both classification and clustering results. We aim to gain deeper insights into the data and discover patterns.
# Read the CSV file from github
dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)
# Identify the character variables in the dataset2
char_vars <- sapply(dataset2, is.character)
# Convert the identified character variables in dataset2 to factors
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)
To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples
set.seed(10)
balanced_dataset <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)
The goal of all preceding steps is to properly prepare the dataset for the classification phase, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels.
since our dataset is small, we decided to use K-fold Cross-validation. for each attribute selection method we will try different K size (10,5, and 3).
in all this section we will be using train and trainControl functions of caret package to produce decision trees. for Gini index the method will be “rpart” and for Gain ratio it’s “j48” as for information gain the method is “C5.0”.
the following function will be used to compute average sensitivity and Specificity:
macro = function(matrix){
sumSen=0
for (i in 1:5) {
sumSen = sumSen + matrix$byClass[i,1]
}
avgSen = sumSen/5
sumSpec=0
for (i in 1:5) {
sumSpec = sumSpec + matrix$byClass[i,2]
}
avgSpec = sumSpec/5
sumPrec=0
for (i in 1:5) {
sumPrec = sumPrec + matrix$byClass[i,3]
}
avgPrec = sumPrec/5
avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Precision=avgPrec ,Accuracy= unname( matrix$overall[1]) )
print(avgs)
}
Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the “rpart” method, which utilizes the Gini index as the criteria for splitting.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex10 <- train(
salary_in_usd ~ .,
data = balanced_dataset,
method = "rpart",
trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
)
)
prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)
#####Confusion matrix of 10 folds using Gini Index
giniIndex10cm = caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)
giniIndex10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 99 16 37 19 3
Low 28 130 42 1 69
Medium 56 64 88 15 14
Very_High 28 5 19 190 10
Very_Low 6 54 6 9 189
Overall Statistics
Accuracy : 0.5815
95% CI : (0.5529, 0.6096)
No Information Rate : 0.2381
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.4752
Mcnemar's Test P-Value : 0.01206
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.45622 0.4833 0.45833
Specificity 0.92347 0.8491 0.85174
Pos Pred Value 0.56897 0.4815 0.37131
Neg Pred Value 0.88465 0.8501 0.89167
Prevalence 0.18129 0.2247 0.16040
Detection Rate 0.08271 0.1086 0.07352
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.68985 0.6662 0.65504
Class: Very_High Class: Very_Low
Sensitivity 0.8120 0.6632
Specificity 0.9356 0.9178
Pos Pred Value 0.7540 0.7159
Neg Pred Value 0.9534 0.8971
Prevalence 0.1955 0.2381
Detection Rate 0.1587 0.1579
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8738 0.7905
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
))
prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)
NA
NA
#####Confusion matrix of 5 folds using Gini Index
giniIndex5cm = caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)
giniIndex5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 98 10 43 16 7
Low 28 123 48 0 71
Medium 64 56 87 14 16
Very_High 25 7 22 192 6
Very_Low 4 53 10 7 190
Overall Statistics
Accuracy : 0.5764
95% CI : (0.5479, 0.6046)
No Information Rate : 0.2423
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4692
Mcnemar's Test P-Value : 0.001289
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.44749 0.4940 0.41429
Specificity 0.92229 0.8449 0.84802
Pos Pred Value 0.56322 0.4556 0.36709
Neg Pred Value 0.88172 0.8641 0.87188
Prevalence 0.18296 0.2080 0.17544
Detection Rate 0.08187 0.1028 0.07268
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.68489 0.6695 0.63116
Class: Very_High Class: Very_Low
Sensitivity 0.8384 0.6552
Specificity 0.9380 0.9184
Pos Pred Value 0.7619 0.7197
Neg Pred Value 0.9608 0.8928
Prevalence 0.1913 0.2423
Detection Rate 0.1604 0.1587
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8882 0.7868
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
))
prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)
NA
NA
#####Confusion matrix of 3 folds using Gini Index
giniIndex3cm = caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)
giniIndex3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 99 9 39 19 8
Low 26 114 48 9 73
Medium 62 46 86 20 23
Very_High 31 6 20 189 6
Very_Low 8 50 9 14 183
Overall Statistics
Accuracy : 0.5606
95% CI : (0.5319, 0.5889)
No Information Rate : 0.2448
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4498
Mcnemar's Test P-Value : 0.0006718
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.43805 0.50667 0.42574
Specificity 0.92276 0.83951 0.84824
Pos Pred Value 0.56897 0.42222 0.36287
Neg Pred Value 0.87586 0.88026 0.87917
Prevalence 0.18881 0.18797 0.16876
Detection Rate 0.08271 0.09524 0.07185
Detection Prevalence 0.14536 0.22556 0.19799
Balanced Accuracy 0.68041 0.67309 0.63699
Class: Very_High Class: Very_Low
Sensitivity 0.7530 0.6246
Specificity 0.9334 0.9104
Pos Pred Value 0.7500 0.6932
Neg Pred Value 0.9344 0.8821
Prevalence 0.2097 0.2448
Detection Rate 0.1579 0.1529
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8432 0.7675
All three trees seem to be alike in their arrangement and form.
rbind("10 Folds"=macro(giniIndex10cm), "5 Folds"=macro(giniIndex5cm), "3 Folds"=macro(giniIndex3cm) )
The higher values for sensitivity, specificity, precision, and accuracy in the 10-fold case indicate better overall performance according to these metrics. so,Gini Index model performs better with a 10-fold cross-validation compared to 5 and 3 folds.
The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.
The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)
#####Confusion matrix of 10 folds using Gain ratio
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)
gainRatio10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 100 18 32 21 3
Low 28 148 40 3 51
Medium 53 49 115 14 6
Very_High 28 6 18 194 6
Very_Low 2 39 11 5 207
Overall Statistics
Accuracy : 0.6383
95% CI : (0.6103, 0.6655)
No Information Rate : 0.2281
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5465
Mcnemar's Test P-Value : 0.167
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.47393 0.5692 0.53241
Specificity 0.92495 0.8698 0.87564
Pos Pred Value 0.57471 0.5481 0.48523
Neg Pred Value 0.89150 0.8792 0.89479
Prevalence 0.17627 0.2172 0.18045
Detection Rate 0.08354 0.1236 0.09607
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69944 0.7195 0.70402
Class: Very_High Class: Very_Low
Sensitivity 0.8186 0.7582
Specificity 0.9396 0.9383
Pos Pred Value 0.7698 0.7841
Neg Pred Value 0.9545 0.9293
Prevalence 0.1980 0.2281
Detection Rate 0.1621 0.1729
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8791 0.8483
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)
#####Confusion matrix of 5 folds using Gain ratio
gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)
gainRatio5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 102 21 34 16 1
Low 31 148 38 1 52
Medium 56 46 103 15 17
Very_High 30 7 18 194 3
Very_Low 3 42 10 9 200
Overall Statistics
Accuracy : 0.6241
95% CI : (0.5959, 0.6516)
No Information Rate : 0.2281
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5289
Mcnemar's Test P-Value : 0.007667
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.45946 0.5606 0.50739
Specificity 0.92615 0.8692 0.86519
Pos Pred Value 0.58621 0.5481 0.43460
Neg Pred Value 0.88270 0.8749 0.89583
Prevalence 0.18546 0.2206 0.16959
Detection Rate 0.08521 0.1236 0.08605
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69281 0.7149 0.68629
Class: Very_High Class: Very_Low
Sensitivity 0.8255 0.7326
Specificity 0.9397 0.9307
Pos Pred Value 0.7698 0.7576
Neg Pred Value 0.9566 0.9218
Prevalence 0.1963 0.2281
Detection Rate 0.1621 0.1671
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8826 0.8317
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)
gainRatio3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 94 18 39 19 4
Low 25 129 39 3 74
Medium 47 47 110 19 14
Very_High 14 4 27 200 7
Very_Low 3 32 9 12 208
Overall Statistics
Accuracy : 0.619
95% CI : (0.5909, 0.6467)
No Information Rate : 0.2565
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5216
Mcnemar's Test P-Value : 0.007322
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.51366 0.5609 0.4911
Specificity 0.92110 0.8542 0.8695
Pos Pred Value 0.54023 0.4778 0.4641
Neg Pred Value 0.91300 0.8910 0.8812
Prevalence 0.15288 0.1921 0.1871
Detection Rate 0.07853 0.1078 0.0919
Detection Prevalence 0.14536 0.2256 0.1980
Balanced Accuracy 0.71738 0.7075 0.6803
Class: Very_High Class: Very_Low
Sensitivity 0.7905 0.6775
Specificity 0.9449 0.9371
Pos Pred Value 0.7937 0.7879
Neg Pred Value 0.9439 0.8939
Prevalence 0.2114 0.2565
Detection Rate 0.1671 0.1738
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8677 0.8073
The observed structure of the three decision trees seems to be the same and it can be summarized as follows:
The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.
rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm) )
Based on the evaluation metrics of average Sensitivity,Precision ,Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it’s worth noting that the difference in performance between the models is relatively small.
A detailed examination of the results from the 10-fold cross-validation reveals that the model has a notably high specificity compared to other metrics. This high specificity suggests that the model is particularly effective at correctly identifying instances that do not pertain to the target class—essentially, it accurately recognizes when examples are not members of the specified class. For example, if the positive class in question is “High” then the model is able to correctly classify tuples that belong to “Very Low”, “Medium”, and “Very High”.
However, possessing high specificity alone does not guarantee the overall effectiveness of the model, as a well-rounded model also requires balanced performance across other metrics. In this case, its ability to capture and classify instances that do belong to the positive class (as measured by sensitivity) is not as robust. For a model to be considered truly effective, it would need to demonstrate strong performance in all metrics specificity and sensitivity, ensuring it can accurately distinguish both negative and positive instances as well as accuracy precision.
Information Gain is a metric used to decide which attribute to choose for splitting the data at each node in the decision tree. For a given dataset, the Information Gain of an attribute is calculated by comparing the entropy before and after the dataset is split based on that attribute. The attribute with the highest Information Gain is chosen as the splitting attribute.
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 3 for this object. Predictions generated using 3 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain10$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)
infoGain10cm= caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)
infoGain10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 90 19 40 20 5
Low 24 127 42 10 67
Medium 51 58 96 19 13
Very_High 17 3 15 208 9
Very_Low 3 43 4 10 204
Overall Statistics
Accuracy : 0.6057
95% CI : (0.5773, 0.6335)
No Information Rate : 0.249
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.5046
Mcnemar's Test P-Value : 0.03427
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.48649 0.5080 0.4873
Specificity 0.91700 0.8490 0.8590
Pos Pred Value 0.51724 0.4704 0.4051
Neg Pred Value 0.90714 0.8673 0.8948
Prevalence 0.15455 0.2089 0.1646
Detection Rate 0.07519 0.1061 0.0802
Detection Prevalence 0.14536 0.2256 0.1980
Balanced Accuracy 0.70174 0.6785 0.6732
Class: Very_High Class: Very_Low
Sensitivity 0.7790 0.6846
Specificity 0.9527 0.9333
Pos Pred Value 0.8254 0.7727
Neg Pred Value 0.9376 0.8992
Prevalence 0.2231 0.2490
Detection Rate 0.1738 0.1704
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8659 0.8089
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain5$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)
infoGain5cm = caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)
infoGain5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 20 47 16 6
Low 25 129 38 10 68
Medium 37 70 99 14 17
Very_High 16 4 13 210 9
Very_Low 2 37 9 13 203
Overall Statistics
Accuracy : 0.6065
95% CI : (0.5782, 0.6343)
No Information Rate : 0.2531
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5049
Mcnemar's Test P-Value : 0.001691
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.51515 0.4962 0.48058
Specificity 0.91376 0.8495 0.86075
Pos Pred Value 0.48851 0.4778 0.41772
Neg Pred Value 0.92180 0.8587 0.88854
Prevalence 0.13784 0.2172 0.17210
Detection Rate 0.07101 0.1078 0.08271
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.71446 0.6728 0.67066
Class: Very_High Class: Very_Low
Sensitivity 0.7985 0.6700
Specificity 0.9550 0.9318
Pos Pred Value 0.8333 0.7689
Neg Pred Value 0.9439 0.8928
Prevalence 0.2197 0.2531
Detection Rate 0.1754 0.1696
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8768 0.8009
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain3$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)
infoGain3cm = caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)
infoGain3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 25 40 19 5
Low 19 137 37 5 72
Medium 55 64 93 11 14
Very_High 16 8 22 197 9
Very_Low 4 43 11 9 197
Overall Statistics
Accuracy : 0.5923
95% CI : (0.5639, 0.6203)
No Information Rate : 0.2481
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.4874
Mcnemar's Test P-Value : 0.01149
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.47486 0.4946 0.45813
Specificity 0.91257 0.8554 0.85513
Pos Pred Value 0.48851 0.5074 0.39241
Neg Pred Value 0.90811 0.8490 0.88542
Prevalence 0.14954 0.2314 0.16959
Detection Rate 0.07101 0.1145 0.07769
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69372 0.6750 0.65663
Class: Very_High Class: Very_Low
Sensitivity 0.8174 0.6633
Specificity 0.9425 0.9256
Pos Pred Value 0.7817 0.7462
Neg Pred Value 0.9534 0.8928
Prevalence 0.2013 0.2481
Detection Rate 0.1646 0.1646
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8799 0.7944
The observed structure of the three decision trees seems to be the same and it can be summarized as follows:
Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the “experience level.” This divides the tree into two main branches or subtrees:
Within the right subtree: In the right sub tree if the experience level is 4(EX) the tree will be divided based on “Company location”
Within the left subtree: In the left subtree it will divide the tree for both two experience levels 1(EN) and 2(MI) based on “employee residence” and when the “employee residence” is “North America” the tree will be further divided based on “salary currency” and when this attribute is equal to “USD” the division will be based on the “job title” attribute
The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.
rbind("10 Folds"=macro(infoGain10cm), "5 Folds"=macro(infoGain5cm), "3 Folds"=macro(infoGain3cm) )
Based on the provided sensitivity, specificity, precision, and accuracy values there isn’t a clear indication of the superiority of one fold over another for Information Gain model .we may need to consider additional factors or conduct further analysis to make a well-informed decision. as can be seen in the table the 10 folds has the best Specificity and Precision, meanwhile the 5 folds has the best Sensitivity and Accuracy.
The gain ratio evaluated with 10-fold cross-validation appears to has the best performance among all the decision tree models. This might be due to the fact that the gain ratio tends to favor unbalanced splits, where one partition is significantly smaller than others. In our dataset, if an attribute has a rare value, the gain ratio might prioritize splitting on this attribute despite the resulting unbalance.
Despite the superiority of the 10-fold, the performance metrics of all three fold sizes (3-fold, 5-fold, and 10-fold) using Gini index, gain ratio, and information gain are relatively similar, suggesting that all three measures are robust within the context of this dataset. A likely contributing factor to this performance consistency is the balanced distribution of class labels in the dataset. When classes are balanced, each splitting criterion is more or less equally likely to encounter informative splits, which serves to decrease performance variability across different splitting methods.
Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group “cluster” are similar to one another and different from data in other clusters. we will utilize k-means clustering.
we will encode the rest of factor columns to transform them into numeric types before clustering, enabling meaningful distance calculations using kmeans and other formulas. we will also remove the class label from the dataset as clustering is an unsupervised learning process, and we will preserve this class label in an attribute for later use. lastly, we will scale all numeric attributes in the dataset so they will be standarized.
# view data
dataset3 <- dataset2
View(dataset3)
# Reserve the salary_in_usd (the class label) column in an attribute before removing it from the dataset for clustering
classLabel <- dataset3[, 5]
# Remove the class lable from the dataset
dataset3 <- dataset3[, -5]
# encoding job_title variable
dataset3$job_title = factor(dataset3$job_title, levels=c("Analyst", "Architect", "Engineer", "Leadership", "Consultant/Specialist","Cyber Security","Others" ), labels=c(4,1,2,5,3,6,7))
# encoding salary_currency variable
dataset3$salary_currency = factor(dataset3$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21))
# encoding employee_residence variable
dataset3$employee_residence = factor(dataset3$employee_residence, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa"), labels=c(1,2,3,4,5,6,7))
# encoding company_location variable
dataset3$company_location = factor(dataset3$company_location, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa", "AQ", "UM"), labels=c(1,2,3,4,5,6,7,8,9))
#Data types should be transformed into numeric types before clustering
#Transforming all non-numeric attributes to numeric type
dataset3$experience_level <- as.numeric(as.character(dataset3$experience_level))
dataset3$job_title <- as.numeric(as.character(dataset3$job_title))
dataset3$salary_currency <- as.numeric(as.character(dataset3$salary_currency))
dataset3$employee_residence <- as.numeric(as.character(dataset3$employee_residence))
dataset3$company_location <- as.numeric(as.character(dataset3$company_location))
dataset3$company_size <- as.numeric(as.character(dataset3$company_size))
# viwe the class of attributes to ensure they have transformed to numeric
sapply(dataset3, class)
work_year experience_level job_title
"numeric" "numeric" "numeric"
salary_currency employee_residence remote_ratio
"numeric" "numeric" "numeric"
company_location company_size
"numeric" "numeric"
#scale all attributes in the dataset so they would be standardized
dataset3 <- scale(dataset3)
now all columns are numeric and scaled, so we are ready to perform k-means clustering
After preprocessing the data, now we are ready to perform the clustering process, we will use the k-means clustering, it is a clustering method that aims to minimize the sum of squared distances between each data point and the centroid of its assigned cluster by iteratively updating cluster assignments and centroids.
We will choose 3 different numbers to perform the k-means clustering on, one of the numbers should be relatevily large, the second should be in the middle and the last should be small. This way we will cover the possible outcomes and clustering results.
Now we will apply Silhouette method to find the average Silhouette coefficient and optimal number of clusters k, we will also plot a graph where x-axis represent the number of clusters and y-axis represent the average Silhouette coefficient
fviz_nbclust(dataset3, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
as seen by the graph, the number of clusters k that maximizes the average Silhouette coefficient is 2, so we will use it for clustering.
This method determines the number of clusters according to the turning point in a curve, the curve is plotted using the total within-cluster sum of square (WSS) as in y-axis , and No. clusters in x-axis
fviz_nbclust(dataset3, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
As shown, the number of clusters k that represents the turning point in the curve is 4, so we will use it for clustering.
Lastly, we will use k=3 since it acheives the second highest average Silhouette coefficient, and since it’s in the middle between 2 and 4 it will strike a balance between having too few clusters (k=2), and having several clusters (k=4), Thus, this choice will have an acceptable acuuracy.
In this section, we will perform k-means clustering and visualize its result using three different k’s that have been chosen beforehand, then we will compute WSS and Bcubed preceision and recall and average silhouette for each cluster as methods of evaluating clustering results.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 2 clusters
kmeans.result <- kmeans(dataset3, 2)
# print the clusterng result
kmeans.result
K-means clustering with 2 clusters of sizes 938, 308
Cluster means:
work_year experience_level job_title salary_currency
1 0.1123321 0.07766291 -0.05252131 -0.3315729
2 -0.3421024 -0.23651887 0.15995128 1.0097901
employee_residence remote_ratio company_location
1 -0.5368064 0.08024947 -0.5259331
2 1.6348196 -0.24439612 1.6017052
company_size
1 0.0410645
2 -0.1250601
Clustering vector:
[1] 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 2 1 2
[29] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
[57] 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1
[85] 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
[113] 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1
[141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
[169] 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[197] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[225] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
[253] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[309] 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[337] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1
[365] 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 2 2 1 1 2
[393] 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
[421] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 2 1 1 2
[449] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2
[477] 1 1 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 1 2 1 2 2 1 1 1 1
[505] 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2
[533] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[561] 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
[589] 1 1 2 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
[617] 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[645] 1 2 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
[673] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 1
[701] 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
[729] 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 2 2
[757] 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2
[785] 2 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1
[813] 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[841] 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 2
[869] 2 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
[897] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1
[925] 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1
[953] 2 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
[981] 1 1 2 2 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 4608.187 2679.470
(between_SS / total_SS = 26.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 7287.657
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.2814606
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.7072104
we can conclude from the graph and the results that this k=2 is possible to be the optimal k, since there is no overlapping between the two clusters, the data in a cluster are close “similar” to each other and dissimilar to data in the other cluster. Also, the recall is relatively high (0.71) and is the highest among the k’s chosen, the Precision is intermediate (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 26.8%, which is an excellent percentage indicating a good compactness of clusters, and that objects in a cluster are similar to one another. Lastly, the average silhouette width is 0.34 which is cnsidered high.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 3 clusters
kmeans.result <- kmeans(dataset3, 3)
# print the clusterng result
kmeans.result
K-means clustering with 3 clusters of sizes 475, 302, 469
Cluster means:
work_year experience_level job_title salary_currency
1 -0.3742571 -0.4628902 -0.004526137 -0.3360410
2 -0.3447469 -0.2440022 0.169029312 1.0307121
3 0.6010356 0.6259307 -0.104257862 -0.3233595
employee_residence remote_ratio company_location
1 -0.5399390 -0.4866784 -0.5233822
2 1.6375401 -0.2426204 1.6311505
3 -0.5076036 0.6491335 -0.5202578
company_size
1 0.2231005
2 -0.1312858
3 -0.1414167
Clustering vector:
[1] 1 1 1 1 2 3 3 3 2 2 1 3 3 1 3 3 3 1 1 2 2 2 2 1 1 2 3 2
[29] 3 3 1 3 1 1 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3 3 2 1 1 1 3 1
[57] 1 2 3 2 3 3 2 2 1 3 3 3 3 3 3 3 3 3 3 2 1 1 2 3 3 3 3 3
[85] 3 3 3 3 2 3 3 2 1 2 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 1 3 3
[113] 3 2 2 1 1 1 1 3 1 2 2 2 2 2 3 3 3 3 3 3 1 3 3 2 2 2 2 3
[141] 3 1 1 3 3 3 1 1 3 3 3 3 3 3 3 2 3 3 1 1 1 1 3 3 3 1 1 3
[169] 3 3 3 3 3 1 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 2
[197] 2 2 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3
[225] 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 1 1 1 1 2 2 3 3 3 3 3 3
[253] 3 3 3 3 1 1 1 1 1 1 3 3 2 2 3 3 3 3 3 3 1 1 3 3 3 3 3 3
[281] 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3
[309] 1 1 3 3 3 3 3 3 3 3 2 2 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3
[337] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 3 3 2 3 2 1 3 3 3 3 3 3 3 3
[365] 1 1 1 2 2 3 3 3 1 3 3 3 1 3 3 2 2 1 3 2 3 3 3 2 2 3 3 2
[393] 2 1 3 1 3 3 3 2 2 3 1 1 3 1 3 3 3 3 1 3 3 1 3 1 1 2 2 3
[421] 3 3 1 3 3 3 3 1 3 1 3 1 2 3 2 1 2 2 3 2 3 3 3 1 2 1 3 2
[449] 3 1 1 1 1 3 3 1 1 2 1 3 3 3 3 2 1 3 1 3 1 3 1 1 1 2 2 2
[477] 1 1 1 3 2 2 2 1 1 3 2 3 1 2 2 3 2 2 3 1 2 3 2 2 3 1 1 1
[505] 1 3 3 3 3 2 1 2 3 2 3 1 3 2 1 3 2 3 3 3 2 1 3 1 3 3 2 2
[533] 2 2 3 3 3 3 1 3 3 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1
[561] 3 3 1 3 3 3 3 2 3 1 2 2 3 3 1 1 1 3 3 3 3 1 1 1 1 1 2 3
[589] 1 1 2 2 1 2 3 3 2 2 1 3 3 3 3 1 3 3 3 3 1 1 3 3 3 2 2 2
[617] 2 3 3 3 3 2 2 2 2 2 1 1 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3
[645] 3 2 3 1 1 3 3 3 3 2 1 2 3 1 2 1 2 3 3 3 1 1 3 3 3 1 3 1
[673] 3 1 1 3 1 1 1 1 3 1 3 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 1
[701] 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 3 2 3 1 1 1 3 1 1 3 1 1
[729] 1 1 1 1 3 1 2 2 1 3 1 1 3 2 1 2 1 1 1 2 3 1 1 1 1 1 2 2
[757] 2 2 2 3 3 3 1 2 2 1 1 2 2 3 1 3 1 3 3 3 1 1 3 1 2 1 1 2
[785] 2 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 3 1 2 3 1
[813] 1 3 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 3 1 1 1 1 1 3 3 1
[841] 2 1 1 1 1 1 1 3 1 1 1 3 3 2 3 1 1 1 2 1 1 1 1 2 1 3 1 2
[869] 2 1 1 1 2 1 2 1 3 2 1 1 2 1 3 1 1 1 1 1 1 1 1 2 1 3 1 3
[897] 1 1 1 3 1 1 2 3 3 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 3 1 2 1
[925] 3 2 3 1 1 1 1 1 1 1 2 1 3 1 1 1 3 1 3 1 1 2 1 1 3 2 2 1
[953] 2 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 3 3 1 1 2 1 1 1
[981] 1 1 2 2 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 2262.211 2615.908 1573.419
(between_SS / total_SS = 35.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 6451.538
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.3059101
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.4537324
we can conclude from the graph and the results where k=3 is that the performance is good but worse than k=2, because there is overlapping between clusters. Also, the recall is relatively high (0.68), However, the Precision is intermediate (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 32%, which is a good percentage indicating an intermidiate compactness of clusters, and that objects in a cluster are to some extent similar to one another. Lastly, the average silhouette width is 0.31 which is good.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 4 clusters
kmeans.result <- kmeans(dataset3, 4)
# print the clusterng result
kmeans.result
K-means clustering with 4 clusters of sizes 291, 259, 48, 648
Cluster means:
work_year experience_level job_title salary_currency
1 -0.2065990 -0.26057796 -0.07392709 -0.3742177
2 -0.2746520 -0.26852805 0.08713240 0.4713219
3 -0.7039739 -0.05919153 0.57502683 3.9452389
4 0.2547005 0.22873170 -0.04422191 -0.3125717
employee_residence remote_ratio company_location
1 -0.5494700 -1.2790118 -0.5320764
2 1.6082647 -0.1776153 1.5917788
3 1.7560475 -0.5190540 1.7138693
4 -0.5261344 0.6838108 -0.5242318
company_size
1 0.03437756
2 -0.13081536
3 -0.07831572
4 0.04264886
Clustering vector:
[1] 1 1 1 1 2 4 4 4 2 2 1 4 4 1 4 4 4 4 4 2 2 2 2 1 1 2 4 2
[29] 4 4 4 4 1 1 4 1 4 4 4 4 1 4 4 1 4 4 1 4 4 4 2 1 1 1 1 1
[57] 1 3 4 2 4 4 2 2 1 4 4 4 4 4 4 4 4 4 4 2 4 4 2 4 4 4 4 4
[85] 4 4 4 4 3 4 4 3 1 2 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4
[113] 4 2 2 4 4 4 4 4 4 2 2 2 2 2 1 4 4 4 4 4 4 4 4 2 2 2 2 4
[141] 4 1 1 4 4 4 4 1 4 4 4 4 4 4 4 2 4 4 4 4 1 1 4 4 4 1 1 4
[169] 4 4 4 4 4 1 1 4 4 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 2 2
[197] 2 2 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4
[225] 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 4 1 1 1 1 2 2 4 4 4 4 4 4
[253] 4 4 4 4 1 1 1 1 1 1 4 4 2 2 4 4 4 4 4 4 1 1 4 4 4 4 4 4
[281] 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 4 4 4 4 4
[309] 1 1 4 4 4 4 4 4 4 4 2 2 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4
[337] 4 4 4 4 4 4 4 4 4 4 1 1 1 1 4 4 2 4 2 4 4 4 4 4 4 4 4 4
[365] 4 1 1 2 2 4 4 4 1 4 4 4 4 4 4 2 2 1 4 2 4 4 4 2 2 4 4 2
[393] 2 4 4 1 4 4 4 2 2 4 1 1 4 1 4 4 4 4 1 4 4 1 4 4 4 2 2 4
[421] 4 4 1 4 4 4 4 1 4 1 4 1 2 4 2 1 2 2 4 2 4 4 4 1 2 1 4 2
[449] 4 1 1 1 1 4 4 1 1 2 4 4 4 4 4 3 4 4 4 4 4 4 1 1 1 2 2 3
[477] 1 1 1 4 2 2 2 1 4 4 2 4 1 2 2 4 3 2 4 1 3 4 2 2 4 4 4 1
[505] 4 4 4 4 4 2 4 2 4 2 4 4 4 2 4 4 2 4 4 4 2 1 4 4 4 4 2 2
[533] 3 2 4 4 4 4 1 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[561] 4 4 4 1 4 4 4 2 4 4 2 2 4 4 1 1 1 4 4 4 4 4 1 1 4 1 2 4
[589] 4 1 2 2 4 2 4 4 3 2 4 4 4 4 4 4 4 4 4 4 4 1 4 2 4 2 2 3
[617] 2 1 4 4 4 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[645] 4 3 4 1 1 4 4 4 4 2 4 2 4 4 2 1 2 4 4 4 1 4 4 4 4 4 4 4
[673] 4 1 1 4 1 4 1 4 4 1 4 1 1 2 3 1 1 2 4 1 4 1 4 2 1 2 1 4
[701] 4 4 4 2 2 3 1 1 1 4 1 1 4 4 1 4 4 2 4 4 1 1 4 1 1 4 1 4
[729] 4 4 1 4 4 2 2 2 4 4 4 4 4 2 4 2 1 4 1 2 4 1 1 4 4 1 2 2
[757] 2 2 2 4 4 4 1 2 2 4 1 2 2 4 4 4 1 4 4 4 1 1 4 4 2 1 1 2
[785] 2 2 4 1 1 3 1 3 1 1 1 3 1 4 1 4 2 2 1 4 1 1 3 4 1 2 4 4
[813] 4 4 2 4 1 4 4 4 2 1 4 2 4 4 2 1 1 1 1 4 4 4 1 1 1 4 4 4
[841] 3 1 4 1 1 1 4 4 1 4 4 4 4 3 4 1 1 1 2 1 1 1 2 2 4 4 1 3
[869] 3 4 1 1 2 1 2 4 4 2 1 1 2 1 4 4 4 1 1 1 1 4 4 2 1 4 4 4
[897] 4 4 1 4 1 4 2 4 4 4 4 1 1 4 1 4 4 1 4 4 4 4 2 2 4 1 2 1
[925] 4 3 4 4 1 1 4 4 4 4 2 4 4 1 4 4 4 4 4 1 1 2 1 4 4 2 2 1
[953] 2 2 3 3 4 2 3 4 2 4 4 1 1 1 4 1 1 1 2 1 4 4 1 4 2 1 1 4
[981] 1 1 2 2 4 3 3 2 2 3 1 1 1 1 1 2 4 4 4 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 1195.7362 1730.4906 422.4286 2562.3908
(between_SS / total_SS = 40.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 5911.046
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.2886446
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.4398836
we can conclude from the graph and the results where k=4 is that the performance is worse than k=2 and k=3, because there is a noticeable overlapping between clusters. Also, the clusers’ space is pretty wide which results in a large distance between objects in the same cluster. Also, the recall is relatively low (0.40) which might be a result of the overlapping and large distances between data objects. In addition, the Precision is intermediate (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 40.7%, which is an acceptable percentage indicating a lower compactness of clusters. Lastly, the average silhouette width is 0.2 which is low.